3.4 Q5: Exploring the primary topics
What are the primary topics being discussed in Dogecoin-related posts, and what insights can topic modeling provide?
NLP pipeline
In our text data processing workflow, we leveraged the Spark NLP library from John Snow Labs to construct a comprehensive pipeline for cleansing and standardizing our text data. The steps included in this pipeline are as follows:
1. DocumentAssembler()
: This is the initial stage that transforms raw input text into a format that Spark NLP can utilize, effectively converting it into annotated documents.
2. Tokenizer()
: This stage segments the text into individual elements or tokens, usually words, which are the basic units for NLP tasks.
3. Normalizer()
: Here, various normalization techniques are employed to standardize the text. This includes converting all characters to lowercase, eliminating punctuation or special characters, and other cleaning procedures.
4. StopWordsCleaner()
: This component is crucial for removing stopwords—commonly occurring words in a language that offer little value for many analytical purposes, such as “is,” “and,” or “the.”
5. Stemmer()
: By including a stemmer, the pipeline reduces words to their base or root form, which can often aid in consolidating variations of a word to a single representative form.
6. Finisher()
: Acting as a crucial terminal component, the Finisher extracts the processed data from the Spark NLP’s structured format and converts it back into a more familiar array of tokens, suitable for further analytical operations or machine learning tasks.
As we construct our machine learning pipeline, these stages are executed in sequence, ensuring a smooth flow from raw text to a cleansed and standardized token array.
Topic Modeling
Topic modeling on Dogecoin-related discussions uncovers several primary topics, including market predictions, community projects, and technological developments. These insights highlight the multifaceted nature of the discourse, extending beyond mere investment discussion.
To address this query, we implemented a Latent Dirichlet Allocation (LDA) Model, a probabilistic approach for topic modeling that uncovers latent topics within a corpus of documents.
LDA posits that each document is composed of a limited number of topics, with each topic being a probability distribution over words. This model represents each document through its topic distribution and each topic through its word distribution. By evaluating these distributions, LDA is able to pinpoint the predominant topics across the documents, even when these topics aren’t explicitly stated in the text. (need to be revised - add pros /cons)
We utilized the LDA function from pyspark.ml.clustering to build our model, applying it to the cleaned dataset from our NLP pipeline. Our model was configured to identify 4 distinct topics.
Table 3: Topic modeling results for posts
Topic | Topic Words | Summary |
0 |
|
Believe in doge would keep rising and advocate users to hold |
1 |
|
Discussion about other meme coins |
2 | doge , crypto , gui , wallet , robinhood , happe , dont , tip , think |
Usage of cold wallet and brokers like robinhood |
3 |
|
Elon Musk’s twitter content |
4 |
|
New published cryptos |
Interpreting the results of topic modeling exercises is not straightforward, as we basically get a bag of words without any specific meaning. However, it is possible to make an informed guess about what ‘topics’ these might refer to, as outlines below.
Topic 1: Dogecoin Enthusiasm and Investment Strategies
As the popularity of Dogecoin continues to rise, many supporters believe in its long-term growth and encourage holding the cryptocurrency as an investment. This topic could likely cover the sentiments and strategies of crypto investors who are optimistic about Dogecoin’s potential to reach new heights, commonly referred to as going “to the moon.”
Topic 2: Exploring the World of Meme Coins Beyond Dogecoin
This topic could possibly be delving into various other meme coins like Banano, discussing their potential and place in the crypto market. These lesser-known coins often foster unique communities.
Topic 3: Practical Tips on Using Cryptocurrency Wallets and Brokers
Focusing on the practical aspects of managing cryptocurrencies, this topic seems like it addresses the use of cold wallets for security and brokers like Robinhood for transactions.
Topic 4: Influence of Elon Musk on Dogecoin
Elon Musk’s engagement with Dogecoin through Twitter has significantly influenced its market movements. This topic shows distinctly that Elon Musk - likely through tweets and public endorsements - is a key figure in Dogecoin’s discourse. Discussions often speculate on his future involvement and its potential impacts on the cryptocurrency, highlighting the interplay between celebrity influence and crypto market dynamics.
Topic 5: Updates and Discussions on New Cryptocurrencies
This topic is a platform for daily updates and discussions on new and emerging cryptocurrencies. It captures the excitement and speculation surrounding new entries into the market, with a particular focus on how they compare or contrast with established players like Dogecoin.
Table 4: Topic modeling results for comments
Topic | Topic Words | Summary |
0 | remov , doge , verifi , +usodogetip , ampxb , yes , let , on , yeah , dogecoin |
Other meme token |
1 |
|
Banano token, belief in dogecoin price |
2 | doge , crypto , dont , people , year , get , buidl , go , monei , coin |
Advocate users to hold crypto and don’t sell |
The comment is different from the post, which only contains three main topics, which is understandable since comment usually only represent attitude to post content. Generally, it only includes the bullish emotion of the crypto market and talking about other meme coin. Some strong meme coin may be mentioned too much time that they can be considered as a individual topic.